Pipeline package is available on CRI Bioinformatics FTP
The Center for Research Informatics (CRI) provides computational resources and expertise in biomedical informatics for researchers in the Biological Sciences Division (BSD) of the University of Chicago.
As a bioinformatics core, we are actively improving our pipelines and expanding pipeline functions. The tutorials will be updated in a timely manner but may not reflect the newest updates of the pipelines. Stay tuned with us for the latest pipeline release.
If you have any questions, comments, or suggestions, feel free to contact our core at bioinformatics@bsd.uchicago.edu or one of our bioinformaticians.
RNA sequencing (RNA-seq) is a revolutionary approach that uses the next-generation sequencing technologies to detect and quantify expressed transcripts in biological samples. Compared to other methods such as microarrays, RNA-seq provides a more unbiased assessment of the full range of transcripts and their isoforms with a greater dynamic range in expression quantification.
In this tutorial, you will learn how to use the CRI’s RNA-seq pipeline (available on both CRI HPC cluster and GitHub)) to analyze Illumina RNA sequencing data. The tutorial comprises the following Steps:
By the end of this tutorial, you will:
This tutorial is based on CRI’s high-performance computing (HPC) cluster. If you are not familiar with this newly assembled cluster, a concise user’s guide can be found here.
The RNA-seq data used in this tutorial are from CRI-BIO-646-BMB-RKeenan.
In this tutorial, we use the sequencing reads in the project CRI-BIO-646 in mouse as example. The sample information are saved in the file CRI-BIO-646.metadata.txt (see below).
Work Flow
There are six (partial) single-end RNA-seq sequencing libraries will be used as the example dataset In this tutorial. Their respective sample information is described in the metadata table CRI-BIO-646/CRI-BIO-646.metadata.txt.
| Sample | Library | ReadGroup | LibType | Platform | SequencingCenter | Date | Lane | Unit | Flavor | Encoding | Run | Genome | NucleicAcid | Group | Location | Seqfile1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WT01 | WT01 | BK-AA-1_S11_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-1_S11_L007_R1_001.fastq.gz |
| WT02 | WT02 | BK-AA-2_S12_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-2_S12_L007_R1_001.fastq.gz |
| WT03 | WT03 | BK-AA-3_S13_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-3_S13_L007_R1_001.fastq.gz |
| KO01 | KO01 | BK-AA-4_S14_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-4_S14_L007_R1_001.fastq.gz |
| KO02 | KO02 | BK-AA-5_S15_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-5_S15_L007_R1_001.fastq.gz |
| KO03 | KO03 | BK-AA-6_S16_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-6_S16_L007_R1_001.fastq.gz |
| WT01 | WT01 | BK-AA-1_S11_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-1_S11_L007_R1_001.fastq.gz |
| WT02 | WT02 | BK-AA-2_S12_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-2_S12_L007_R1_001.fastq.gz |
| WT03 | WT03 | BK-AA-3_S13_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-3_S13_L007_R1_001.fastq.gz |
| KO01 | KO01 | BK-AA-4_S14_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-4_S14_L007_R1_001.fastq.gz |
| KO02 | KO02 | BK-AA-5_S15_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-5_S15_L007_R1_001.fastq.gz |
| KO03 | KO03 | BK-AA-6_S16_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-6_S16_L007_R1_001.fastq.gz |
We will use SSH (Secure Shell) to connect to CRI’s HPC. SSH now is included or can be installed in all standard operating systems (Windows, Linux, and OS X).
The login procedure varies slightly depending on whether you use a Mac/Unix/Linux computer or a Windows computer.
Connect to the login node of the CRI HPC cluster:
$ ssh -l username@gardner.cri.uchicago.eduType in the password when prompted
Make sure that you replace
usernamewith your login name.
- CAUTION
- THIS PACKAGE IS LARGE, PLEASE DO NOT DOWNLOAD IT TO YOUR HOME DIRECTORY
- USE OTHER LOCATION LIKE /gpfs/data/bioinformatics/username
You should be in your home directory after logging in
$ pwd
/home/username
$ cd /gpfs/data/bioinformatics/username; pwd
/gpfs/data/bioinformatics/usernameInstead of downloading the pipeline package to your local home directory, use other location like /gpfs/data/bioinformatics/username
$ cd /gpfs/data/bioinformatics/username; pwd
/gpfs/data/bioinformatics/usernameOne way to download the pipeline package via git clone
$ cp ftp://logia.cri.uchicago.edu/bioinformatics/tutorials/Nov2018/CRI-BIO-646-BMB-RKeenan.tgz .Uncompress the tarball file
$ tar -zxvf CRI-BIO-646-BMB-RKeenan.tgzChange working directory to pipeline dirctory
$ cd CRI-BIO-646-BMB-RKeenan
$ tree -d
|-- CRI-BIO-646
| |-- data
| | |-- 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
| | | `-- FastQ
| | `-- 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
| | `-- FastQ
| `-- references
| `-- v28_92_GRCh38.p12
| `-- STAR
|-- SRC
| |-- Python
| | |-- lib
| | |-- module
| | `-- util
| `-- R
| |-- module
| `-- util
`-- docs
`-- IMGRaw sequencing data files (*.fastq.gz) are located at CRI-BIO-646/data/
$ tree CRI-BIO-646/data/
|-- 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
| `-- FastQ
| |-- BK-AA-1_S11_L007_R1_001.fastq.gz
| |-- BK-AA-2_S12_L007_R1_001.fastq.gz
| |-- BK-AA-3_S13_L007_R1_001.fastq.gz
| |-- BK-AA-4_S14_L007_R1_001.fastq.gz
| |-- BK-AA-5_S15_L007_R1_001.fastq.gz
| `-- BK-AA-6_S16_L007_R1_001.fastq.gz
`-- 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
`-- FastQ
|-- BK-AA-1_S11_L007_R1_001.fastq.gz
|-- BK-AA-2_S12_L007_R1_001.fastq.gz
|-- BK-AA-3_S13_L007_R1_001.fastq.gz
|-- BK-AA-4_S14_L007_R1_001.fastq.gz
|-- BK-AA-5_S15_L007_R1_001.fastq.gz
`-- BK-AA-6_S16_L007_R1_001.fastq.gzGenome data are located at /gpfs/data/bioinformatics/ReferenceData/cri_rnaseq_2018/vM18_93_GRCm38.p6
$ tree /gpfs/data/bioinformatics/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12
|-- GRCh38_rRNA.bed
|-- GRCh38_rRNA.bed.interval_list
|-- STAR
| |-- Genome
| |-- Log.out
| |-- SA
| |-- SAindex
| |-- chrLength.txt
| |-- chrName.txt
| |-- chrNameLength.txt
| |-- chrStart.txt
| |-- exonGeTrInfo.tab
| |-- exonInfo.tab
| |-- geneInfo.tab
| |-- genomeParameters.txt
| |-- run_genome_generate.logs
| |-- sjdbInfo.txt
| |-- sjdbList.fromGTF.out.tab
| |-- sjdbList.out.tab
| `-- transcriptInfo.tab
|-- genes.gtf
|-- genes.gtf.bed12
|-- genes.refFlat.txt
|-- genome.chrom.sizes
|-- genome.dict
`-- genome.faproject related files (i.e., metadata & configuration file) as used in this tutorial are located under CRI-BIO-646/
$ ls -l CRI-BIO-646.*
|-- CRI-BIO-646.metadata.txt
|-- CRI-BIO-646.pipeline.yaml
Here are the first few lines in the configuration example file CRI-BIO-646/CRI-BIO-646.pipeline.yaml
---
pipeline:
flags:
aligners:
run_star: True
quantifiers:
run_featurecounts: True
run_rsem: False
run_kallisto: False
callers:
run_edger: True
run_deseq2: True
run_limma: True
software:
main:
use_module: 0
adapter_pe: AGATCGGAAGAGCGGTTCAG,AGATCGGAAGAGCGTCGTGT
adapter_se: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
fastq_format: 33
genome_assembly: hg38When running on another dataset, you will need to modify these
two filesand the master pipeline script (i.e.,Build_RNAseq.CRI-BIO-646.sh) (as described below) accordingly.For instance, if you would like to turn off the DE analysis tool limma, you can set the respecitve paramter to ‘False’ in configuration file
- run_limma: False
For metadata file, you might pay attendtion on the following settings
- Single End (SE) Library
- Set Flavor column as 1xReadLength (e.g., 1x50)
- Set Seqfile1 column as the file name of the repective sequencing file
- Paired End (PE) Library
- Set Flavor column as 2xReadLength (e.g., 2x50)
- Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
- Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
- Non strand-specific Library
- Set LibType column to NS
- Strand-specific Library
- Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
Master pipeline script
$ cat Build_RNAseq.CRI-BIO-646.sh
## build pipeline scripts
now=$(date +"%m-%d-%Y_%H:%M:%S")
## project info
project="CRI-BIO-646"
SubmitRNAseqExe="Submit_${PWD##*/}.sh"
padding="CRI-BIO-646/"
## command
echo "START" `date` " Running build_rnaseq.py"
python3 SRC/Python/build_rnaseq.py \
--projdir $PWD \
--metadata $PWD/${padding}$project.metadata.txt \
--config $PWD/${padding}$project.pipeline.yaml \
--systype cluster \
--threads 8 \
--log_file $PWD/Build_RNAseq.$project.$now.log
## submit pipeline master script
echo "START" `date` " Running $SubmitRNAseqExe"
echo "bash $SubmitRNAseqExe"
echo "END" `date`
Basically, when running on your own dataset, you will need to modify this master pipeline script (i.e.,
Build_RNAseq.CRI-BIO-646.sh) accordingly.For instance, you can change respective parameters as follows.
- project=“PROJECT_AS_PREFIX” (e.g., CRI-BIO-646 which is used as a prefix of metadata file CRI-BIO-646.metadata.txt and configuration file CRI-BIO-646.pipeline.yaml)
- padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., CRI-BIO-646 which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)
Before running, you need to know
The master BigDataScript script can be
ONLY run on a head/entry nodeother than any other compuatation node. But, you can step by step run individual sub-task bash scripts in any computation node interactively.
Generate sub-task scirpts of the pipepline
# load modules
$ module purge;module load gcc udunits R/3.5.0 python/3.6.0; module update
# This step is optional but it will install all necessary R packages ahead.
# In case the pipeline was terminated due to the failure of R package installation later when running the pipeline.
# A successful execution result will show the package versions of all necessary packages from devtools to pheatmap
$ Rscript --vanilla SRC/R/util/prerequisite.packages.R
# create directories and generate all necessary scripts
$ bash Build_RNAseq.CRI-BIO-646.sh
this step will execute SRC/Python/build_rnaseq.py using python3 to generate all sub-task bash scripts and directories according to the provided metadata and configuration files (i.e., CRI-BIO-646/CRI-BIO-646.metadata.txt and CRI-BIO-646/CRI-BIO-646.pipeline.yaml )
$ tree -d RNAseq
|-- Aln
| `-- star
| |-- KO01
| | |-- BK-AA-4_S14_L007_R355
| | `-- BK-AA-4_S14_L007_R357
| |-- KO02
| | |-- BK-AA-5_S15_L007_R355
| | `-- BK-AA-5_S15_L007_R357
| |-- KO03
| | |-- BK-AA-6_S16_L007_R355
| | `-- BK-AA-6_S16_L007_R357
| |-- WT01
| | |-- BK-AA-1_S11_L007_R355
| | `-- BK-AA-1_S11_L007_R357
| |-- WT02
| | |-- BK-AA-2_S12_L007_R355
| | `-- BK-AA-2_S12_L007_R357
| `-- WT03
| |-- BK-AA-3_S13_L007_R355
| `-- BK-AA-3_S13_L007_R357
|-- AlnQC
| |-- picard
| | `-- star
| | |-- KO01
| | | `-- tmp
| | |-- KO02
| | | `-- tmp
| | |-- KO03
| | | `-- tmp
| | |-- WT01
| | | `-- tmp
| | |-- WT02
| | | `-- tmp
| | `-- WT03
| | `-- tmp
| `-- rseqc
| `-- star
| |-- KO01
| | `-- tmp
| |-- KO02
| | `-- tmp
| |-- KO03
| | `-- tmp
| |-- WT01
| | `-- tmp
| |-- WT02
| | `-- tmp
| `-- WT03
| `-- tmp
|-- DEG
| |-- deseq2
| | `-- featurecounts
| | `-- star
| |-- edger
| | `-- featurecounts
| | `-- star
| `-- limma
| `-- featurecounts
| `-- star
|-- LociStat
| `-- featurecounts
| `-- star
| `-- CRI-BIO-646-BMB-RKeenan
|-- PostAna
| |-- clusterprofiler
| | `-- featurecounts
| | `-- star
| | `-- CRI-BIO-646-BMB-RKeenan
| `-- pheatmap
| `-- featurecounts
| `-- star
| `-- CRI-BIO-646-BMB-RKeenan
|-- QuantQC
| `-- featurecounts
| `-- star
|-- Quantification
| `-- featurecounts
| `-- star
| |-- KO01
| |-- KO02
| |-- KO03
| |-- WT01
| |-- WT02
| `-- WT03
|-- RawReadQC
| |-- KO01
| | |-- BK-AA-4_S14_L007_R355
| | | `-- BK-AA-4_S14_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-4_S14_L007_R357
| | `-- BK-AA-4_S14_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- KO02
| | |-- BK-AA-5_S15_L007_R355
| | | `-- BK-AA-5_S15_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-5_S15_L007_R357
| | `-- BK-AA-5_S15_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- KO03
| | |-- BK-AA-6_S16_L007_R355
| | | `-- BK-AA-6_S16_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-6_S16_L007_R357
| | `-- BK-AA-6_S16_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- WT01
| | |-- BK-AA-1_S11_L007_R355
| | | `-- BK-AA-1_S11_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-1_S11_L007_R357
| | `-- BK-AA-1_S11_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- WT02
| | |-- BK-AA-2_S12_L007_R355
| | | `-- BK-AA-2_S12_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-2_S12_L007_R357
| | `-- BK-AA-2_S12_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| `-- WT03
| |-- BK-AA-3_S13_L007_R355
| | `-- BK-AA-3_S13_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| `-- BK-AA-3_S13_L007_R357
| `-- BK-AA-3_S13_L007_R1_001_fastqc
| |-- Icons
| `-- Images
`-- shell_scriptsExecute the entire analysis with just one command
# This will start to run the entire pipeline.
# You can chekc teh BDS report to know the running status.
$ bash Submit_CRI-BIO-646-BMB-RKeenan.sh
Submit_CRI-BIO-646-BMB-RKeenan.bds, so you will need to run this command on a head/entry node.Check running status
- $ qstat -a
- tail *.bds.log
For the first step, the pipeline will perform quality assessment on the raw fastq files.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/RawReadQC/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R1_001_fastqc.zip' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ/BK-AA-4_S14_L007_R1_001.fastq.gz' ], cpus := 1, mem := 16*G, timeout := 72*hour, taskName := "FastQC.BK-AA-4_S14_L007_R355") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/RawReadQC/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R1_001_fastqc.zip' ] )
This code chunk will invoke the bash script RNAseq/shell_scripts/run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh to execute FastQC on the KO01(BK-AA-1_S11_L007_R355) sequencing library.
After the completion of the entire pipeline, you can check FastQC report per individual libraries; for instance, the partial report of KO01 will be as follows or a full report.
You can check FastQC Help for more details about how to interpret a FastQC report.
Or, compare your reports to the example reports provided by FastQC for a Good Illumina Data or Bad Illumina Data.
In this step, the pipeline will conduct read alignment on the raw fastq files.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.alignRead.star.BK-AA-4_S14_L007_R355.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.bam' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ/BK-AA-4_S14_L007_R1_001.fastq.gz' ], cpus := 4, mem := 64*G, timeout := 72*hour, taskName := "star.BK-AA-4_S14_L007_R355") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alignRead.star.BK-AA-4_S14_L007_R355.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.bam' ] )
This code chunk will invoke the bash script RNAseq/shell_scripts/run.alignRead.star.BK-AA-4_S14_L007_R355.sh to execute STAR on the KO01(BK-AA-4_S14_L007_R355) sequencing library.
After the completion of the entire pipeline, you can check the alignment result of each individual libraries; for instance, the result of KO01(BK-AA-4_S14_L007_R355) will be as follows.
$ tree RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355
RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355
|-- BK-AA-4_S14_L007_R355.star.Aligned.sortedByCoord.out.bam
|-- BK-AA-4_S14_L007_R355.star.Log.final.out
|-- BK-AA-4_S14_L007_R355.star.Log.out
|-- BK-AA-4_S14_L007_R355.star.Log.progress.out
|-- BK-AA-4_S14_L007_R355.star.SJ.out.tab
|-- BK-AA-4_S14_L007_R355.star.Unmapped.out.mate1
|-- BK-AA-4_S14_L007_R355.star.bai
|-- BK-AA-4_S14_L007_R355.star.bam -> BK-AA-4_S14_L007_R355.star.Aligned.sortedByCoord.out.bam
`-- run.alignRead.star.BK-AA-4_S14_L007_R355.log
You can check a log file (e.g., RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.Log.final.out) for more alignment information provided by STAR.
$ cat RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.Log.final.out
## Started job on | Nov 07 12:19:19
## Started mapping on | Nov 07 12:19:34
## Finished on | Nov 07 12:21:59
## Mapping speed, Million of reads per hour | 430.40
##
## Number of input reads | 17335560
## Average input read length | 50
## UNIQUE READS:
## Uniquely mapped reads number | 14809963
## Uniquely mapped reads % | 85.43%
## Average mapped length | 49.84
## Number of splices: Total | 2272172
## Number of splices: Annotated (sjdb) | 2259453
## Number of splices: GT/AG | 2249028
## Number of splices: GC/AG | 19534
## Number of splices: AT/AC | 2520
## Number of splices: Non-canonical | 1090
## Mismatch rate per base, % | 0.20%
## Deletion rate per base | 0.00%
## Deletion average length | 1.44
## Insertion rate per base | 0.00%
## Insertion average length | 1.28
## MULTI-MAPPING READS:
## Number of reads mapped to multiple loci | 0
## % of reads mapped to multiple loci | 0.00%
## Number of reads mapped to too many loci | 2403663
## % of reads mapped to too many loci | 13.87%
## UNMAPPED READS:
## % of reads unmapped: too many mismatches | 0.00%
## % of reads unmapped: too short | 0.56%
## % of reads unmapped: other | 0.15%
## CHIMERIC READS:
## Number of chimeric reads | 0
## % of chimeric reads | 0.00%
In this step, the pipeline will conduct a QC on alignment result.
The BDS code snippets for the sample KO01 will look like:
$ grep -A1 run.alnQC.*.star.KO01.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "picard.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.clipping_profile.py.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.infer_experiment.py.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] )
This code chunk will invoke few bash scripts (e.g., RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh, run.alnQC.rseqc.star.KO01.clipping_profile.py.sh, and run.alnQC.rseqc.star.KO01.infer_experiment.py.sh) to execute alignment QC tools (i.e., Picard and RSeQC) on the sample KO01.
After the completion of the entire pipeline, you can check the alignment QC results of each individual samples; for instance, the results of KO01 will be as follows.
$ tree RNAseq/AlnQC/*/star/KO01
RNAseq/AlnQC/picard/star/KO01
|-- KO01.star.picard.RNA_Metrics
|-- KO01.star.picard.RNA_Metrics.pdf
|-- run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.log
`-- tmp
RNAseq/AlnQC/rseqc/star/KO01
|-- KO01.star.rseqc.clipping_profile.pdf
|-- KO01.star.rseqc.clipping_profile.r
|-- KO01.star.rseqc.clipping_profile.xls
|-- KO01.star.rseqc.infer_experiment.txt
|-- run.alnQC.rseqc.star.KO01.clipping_profile.py.log
`-- tmp
You can check alignment statistics (e.g., RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics) for more information provided by Picard.
$ head RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics
## ## htsjdk.samtools.metrics.StringHeader
## # picard.analysis.CollectRnaSeqMetrics REF_FLAT=/gpfs/data/bioinformatics/cri_rnaseq_2018/example/references/v28_92_GRCh38.p12/genes.refFlat.txt RIBOSOMAL_INTERVALS=/gpfs/data/bioinformatics/cri_rnaseq_2018/example/references/v28_92_GRCh38.p12/GRCh38_rRNA.bed.interval_list STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND CHART_OUTPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf METRIC_ACCUMULATION_LEVEL=[SAMPLE, ALL_READS] INPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bam OUTPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics TMP_DIR=[/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/tmp] MINIMUM_LENGTH=500 RRNA_FRAGMENT_PERCENTAGE=0.8 ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
## ## htsjdk.samtools.metrics.StringHeader
## # Started on: Wed Nov 07 12:32:15 CST 2018
##
## ## METRICS CLASS picard.analysis.RnaSeqMetrics
## PF_BASES PF_ALIGNED_BASES RIBOSOMAL_BASES CODING_BASES UTR_BASES INTRONIC_BASES INTERGENIC_BASES IGNORED_READS CORRECT_STRAND_READS INCORRECT_STRAND_READS PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES PCT_USABLE_BASES PCT_CORRECT_STRAND_READS MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS SAMPLE LIBRARY READ_GROUP
## 1582833150 1577842710 334200 876931136 608664079 71989244 19924828 0 4418837 102189 0.000212 0.555779 0.385757 0.045625 0.012628 0.941536 0.938567 0.977397 0.904811 0.026924 0.99656 0.041207
## 1582833150 1577842710 334200 876931136 608664079 71989244 19924828 0 4418837 102189 0.000212 0.555779 0.385757 0.045625 0.012628 0.941536 0.938567 0.977397 0.904811 0.026924 0.99656 0.041207 unknown
Or, the respective coverage plot of the sample KO01 produced by Picard will be as follows.
There are two alignment measurements performed using RSeQC.
The results will be as follows. Please check the RSeQC website for more measurements and details.
$cat RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt
##
##
## This is SingleEnd Data
## Fraction of reads failed to determine: 0.0757
## Fraction of reads explained by "++,--": 0.0179
## Fraction of reads explained by "+-,-+": 0.9064
Here, you can confirm again the sample KO01 is a single-end, strand-specific library using the second-strand synthesis.
In this step, the pipeline will conduct expression quantification over alignments.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.quant.featurecounts.star.KO01.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "featurecounts.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] )
This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh) to execute expression quantification tool (i.e., Subread::featureCounts on the sample KO01.
After the completion of the entire pipeline, you can check the quantification results of each individual samples; for instance, the results of KO01 will be as follows.
$ tree RNAseq/Quantification/featurecounts/star/KO01
RNAseq/Quantification/featurecounts/star/KO01
|-- KO01.star.featurecounts.count
|-- KO01.star.featurecounts.count.jcounts
|-- KO01.star.featurecounts.count.summary
`-- run.quant.featurecounts.star.KO01.log
You can check quantification statistics (e.g., RNAseq/Quantification/featurecounts/star/KO01KO01.star.featurecounts.count.summary) for more information provided by Subread::featureCounts
$ cat RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count.summary
## Status /gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bam
## Assigned 28638683
## Unassigned_Unmapped 0
## Unassigned_MappingQuality 0
## Unassigned_Chimera 0
## Unassigned_FragmentLength 0
## Unassigned_Duplicate 0
## Unassigned_MultiMapping 0
## Unassigned_Secondary 0
## Unassigned_Nonjunction 0
## Unassigned_NoFeatures 2036590
## Unassigned_Overlapping_Length 0
## Unassigned_Ambiguity 981390
Or, the top 10 most abundant genes in the sample KO01 will be as follows.
$ cat <(head -n2 RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | tail -n+2 | cut -f1,7) <(cut -f1,7 RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | sort -k2,2nr | head)
| Geneid | Chr | Start | End | Strand | Length | KO01 |
|---|---|---|---|---|---|---|
| ENSG00000210082.2 | chrM | 1671 | 3229 |
|
1559 | 129707 |
| ENSG00000198804.2 | chrM | 5904 | 7445 |
|
1542 | 129552 |
| ENSG00000167658.15 | chr19 | 3976056 | 3985469 |
|
4027 | 105680 |
| ENSG00000229807.10 | chrX | 73820651 | 73851091 |
|
19961 | 99939 |
| ENSG00000074800.14 | chr1 | 8861000 | 8879190 |
|
5276 | 88419 |
| ENSG00000198886.2 | chrM | 10760 | 12137 |
|
1378 | 86567 |
| ENSG00000111640.14 | chr12 | 6533927 | 6538371 |
|
2981 | 79933 |
| ENSG00000184009.10 | chr17 | 81509971 | 81523847 |
|
2993 | 75963 |
| ENSG00000198712.1 | chrM | 7586 | 8269 |
|
684 | 70428 |
| ENSG00000067225.17 | chr15 | 72199029 | 72231822 |
|
8571 | 69910 |
In this step, the pipeline will identify differentially expressed genes (DEG) according to the alignment result files (i.e., BAM files) after the alignment step.
The BDS code snippets for the example dataset will look like:
$ grep -A1 run.call.*.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt' ] )
This code chunk will invoke few bash scripts (e.g., RNAseq/shell_scripts/run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh, run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh, and run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to execute differential expression (DE) analysis using three the state-of-the-art tools (i.e., edgeR, DESeq2, and limma) on the example dataset of six samples from KO01 to WT03.
There are three DE analysis tools used in the current pipeline, including
After the completion of the entire pipeline, you can check the calling results of each individual methods; for instance, the analysis results of the example dataset will be as follows.
$ tree RNAseq/DEG/*/featurecounts/star/
RNAseq/DEG/deseq2/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.ntd.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.ntd.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.rld.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.rld.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.vst.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.vst.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.plotDispEsts.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt
`-- run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
RNAseq/DEG/edger/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotBCV.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotSmear.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.txt
`-- run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
RNAseq/DEG/limma/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.voom.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.voom.mean-variance.pdf
`-- run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
You can check statistical test results per gene (e.g., RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt) for more information generated by each method.
$ head RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt
| Geneid | baseMean | log2FoldChange | lfcSE | stat | pvalue | FDR | DEG | |
|---|---|---|---|---|---|---|---|---|
| 8943 | ENSG00000139219.18 | 7593.6731 | 3.364988 | 0.0909209 | 37.01005 | 0 | 0 | 1 |
| 9845 | ENSG00000092067.5 | 3022.5103 | 3.543611 | 0.0958003 | 36.98957 | 0 | 0 | 1 |
| 14075 | ENSG00000101160.13 | 1270.4139 | 3.254167 | 0.1034734 | 31.44932 | 0 | 0 | 1 |
| 13557 | ENSG00000142530.10 | 1241.2705 | -3.737343 | 0.1221921 | -30.58581 | 0 | 0 | -1 |
| 9650 | ENSG00000136167.13 | 2428.1942 | 2.931033 | 0.0974567 | 30.07524 | 0 | 0 | 1 |
| 8362 | ENSG00000084207.16 | 1082.8763 | 3.350925 | 0.1188343 | 28.19830 | 0 | 0 | 1 |
| 9368 | ENSG00000111331.12 | 575.0466 | 3.131198 | 0.1228025 | 25.49785 | 0 | 0 | 1 |
| 5104 | ENSG00000135346.8 | 374.1435 | -5.235472 | 0.2229213 | -23.48574 | 0 | 0 | -1 |
| 14910 | ENSG00000165194.15 | 1085.3843 | -2.145027 | 0.0973406 | -22.03630 | 0 | 0 | -1 |
| 2440 | ENSG00000163053.10 | 2297.1436 | -1.763135 | 0.0818927 | -21.52983 | 0 | 0 | -1 |
Or, the exploratory plots of the example dataset produced by DESeq2 will be as follows.
In this step, the pipeline will collect DEG statistics and identify the overlapping set of identified DEGs from the previous methods.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.lociStat.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ] )
This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to collect DEG statistics and to make a Venn diagram plot.
After the completion of the entire pipeline, you can check the statistics result of DEGs per method; for instance, the example dataset CRI-BIO-646 will be as follows.
$ grep -A5 'Up/Down regulated DEGs per methods' RNAseq/LociStat/featurecounts/star/*/run.lociStat.featurecounts.star.*.log | tail -n+2
## grep: result/run.lociStat.featurecounts.star.*.log: No such file or directory
$ cut -f1,2,4 RNAseq/LociStat/featurecounts/star/*/*.star.featurecounts.VennList.txt
| Methods | Method.Num | ID.Num |
|---|---|---|
| edger | 1 | 16 |
| deseq2 | 1 | 3 |
| limma | 1 | 68 |
| edger&deseq2 | 2 | 45 |
| edger&limma | 2 | 107 |
| deseq2&limma | 2 | 3 |
| edger&deseq2&limma | 3 | 2669 |
There is a Venn diagram plot will be generated after this step.
In this step, the pipeline will make a PCA plot based on the transcriptional profiling of all samples.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.quantQC.pca.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/QuantQC/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.pca.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/QuantQC/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.pca.pdf' ] )
This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to make a PCA plot based on the alignment quantification result generated by DESeq2 or one of DE analysis tools.
After the completion of the entire pipeline, you can check the PCA plot under the folder of QuantQC/.
In this step, the pipeline will make a heat map based on the overlapping set of DEGs identified across different DE analysis tools.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.postAna.pheatmap.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/pheatmap/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.heatmap.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/pheatmap/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.heatmap.pdf' ] )
This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to make a heat map plot based on the overlapping set of DEGs identified across different DE analysis tools (i.e., edgeR, DESeq2, and limma).
After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.
In this step, the pipeline will conduct enrichment analysis and make several exploratory plots based on the overlapping set of DEGs identified across different DE analysis tools.
The BDS code snippet for the projet CRI-BIO-646-BMB-RKeenan will look like:
$ grep -A1 run.postAna.clusterprofiler.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt' ] )
This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to conduct enrichment analyses including GO and KEGG pathway erichment analyses as well as gene set enrichment analysis (GSEA).
After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.
$ tree RNAseq/PostAna/clusterprofiler/featurecounts/star/*/
RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.cnetplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.dotplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.emapplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGSEAGO.ALL.pos001.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGSEAGO.ALL.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.cnetplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.dotplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.emapplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.txt
`-- run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
Below, several plots will be generated based on the overlapping set of DEGs using GO database as example.
Considering the environment setting in the CRI HPC system, BigDataScript was used as a job management system in the current development to achieve an automatic pipeline. It can handle the execution dependency of all sub-task bash scripts and resume from a failed point, if any.
After the completion of the entire pipeline, you will see a BigDataScript report in HTML under the pipeline folder. For instance, this is the report from one test run. The graphic timeline will tell you the execution time per sub-task script.
[Last Updated on 2018/11/07] | Top